[WIP] Issue 7 : pbs cluster init, class inheritance refactoring#10
Conversation
mrocklin
left a comment
There was a problem hiding this comment.
So looking for comment here as this is my first pull request ever.
Welcome! I hope that it is a positive experience for you. I believe that this work will have some impact, so I hope that you stay involved.
I've added a few quick comments below. They are mostly stylistic rather than substantial. I hope to have more time to try this out on a PBS system soon, although I may not have time until one week from today (going on vacation). Hopefully the others involved in this project will also be able to raise some thoughts here.
| """ | ||
|
|
||
| _submitcmd = 'qsub' | ||
| _cancelcmd = 'qdel' |
There was a problem hiding this comment.
It might be nice to rename these to the more explicit (if perhaps also longer) submit_command and cancel_command and remove the properties that point to them.
There was a problem hiding this comment.
Okay, I'll try this, comming from Java background, it was more natural to me to override kind of methods instead of attributes :).
| if queue != None: header_lines.append('#PBS -q %s' % queue) | ||
| if project != None: header_lines.append('#PBS -A %s' % project) | ||
| if resource_spec != None: header_lines.append('#PBS -l %s' % resource_spec) | ||
| if walltime != None: header_lines.append('#PBS -l walltime=%s' % walltime) |
There was a problem hiding this comment.
Most other dask related projects use the flake8 style checker, which would probably raise some warnings here for combining the if block into a single line. You might want to run the following from the root of this directory:
flake8 dask_jobqueue
But we should probably add this to CI so that we can automate this.
There was a problem hiding this comment.
That's fine with me. Perhaps we could enable Travis CI to do some basic tests and style checking, without having all the docker testing env yet? Should we raise an issue separated from #2 ?
There was a problem hiding this comment.
I've done this in #13. You can also run the same checker locally by running: flake8 -j auto dask_jobqueue
| :param death_timeout: | ||
| :param local_directory: | ||
| :param extra: | ||
| :param kwargs: |
There was a problem hiding this comment.
So far we've been using numpydoc style docstrings: http://dask.pydata.org/en/latest/develop.html#docstrings
| Abstract attribute for job scheduler cancel command | ||
| :return: | ||
| """ | ||
| raise NotImplementedError |
There was a problem hiding this comment.
I'm in favor of removing these properties and just going with class attributes if possible
| job_extra : list | ||
| List of other PBS options, for example -j oe. Passed with '#PBS ' prefix | ||
| local_directory : str | ||
| Dask worker local directory for file spilling. |
There was a problem hiding this comment.
Why remove threads_per_worker above? Also did you want to logically separate the dask options from the PBS options?
There was a problem hiding this comment.
Ah, scrolling down I see that you've done just that. If so then local_directory should probably be an inherited parameter. I think it should be generic.
There was a problem hiding this comment.
Yup I separated options, and for more consistancy with other keywords I used threads instead of thread_per_worker.
local_directory is generic, but I wanted to override it in PBSCluster, because standard PBS installation always provide a $TMPDIR variable which resolve to something like /tmp/{JOB_ID}/. I wanted to make this as the default local_directory for PBSCluster workers.
|
So I've pushed another commit with @mrocklin comments taken care of. Tell me how you feel about it. |
|
Oh, And I've tested on our cluster, and it worked like a charm (even if I did not push the really far). |
jhamman
left a comment
There was a problem hiding this comment.
I generally like how this turned out. I've made a number of small comments, mostly targeting how users will interpret different options.
| self._adaptive = None | ||
|
|
||
| #dask-worker command line build | ||
| self._command_template = "%s/dask-worker %s" % (dirname, self.scheduler.address) |
There was a problem hiding this comment.
slightly safer to write this as:
self._command_template = os.path.join(dirname, 'dask-worker' % self.scheduler.address)| ------- | ||
| A string containing multiple lines to be used as header of job file to be sumitted. | ||
| """ | ||
| raise NotImplementedError |
There was a problem hiding this comment.
Let's just make this a class attribute that defaults to an empty string.
There was a problem hiding this comment.
Okay, I tried to keep one of my "contract" of Abstract Class, but I guess it is still too Java oriented :).
| -------- | ||
| >>> from dask_jobqueue import PBSCluster | ||
| >>> cluster = PBSCluster(project='...') | ||
| >>> cluster = PBSCluster(queue= 'regular', project='DaskOnPBS') |
| death_timeout=60, | ||
| extra='', | ||
| job_extra=[], | ||
| local_directory='$TMPDIR', |
There was a problem hiding this comment.
I have a slight preference to set this to None by default. Either that or os.getenv('TMPDIR', None).
There was a problem hiding this comment.
I believed that $TMPDIR was a PBS default, but apparently it should be PBS_TMPDIR. Can you check on the cluster you are using?
Maybe it is more generic to just remove this arg by default.
There was a problem hiding this comment.
On our system (NCAR's cheyenne), TMPDIR is set to /var/tmp/{JOB_ID}. So given the divergence, I suggest just removing this as a default.
There was a problem hiding this comment.
Then our systems are configured with the same variable (if not set to the same directory), it is difficult to generalise from two examples, but TMPDIR seems to be used for a scratch dir for you and me. Stilled removed it maybe I will add a comment for this.
| threads_per_worker=4, | ||
| processes=9, | ||
| memory='7GB', | ||
| resource_spec='select=1:ncpus=24:mem=100GB', |
There was a problem hiding this comment.
I'm not totally sure what to do here. One option is to leave this as None and force (most) users to set this. Keeping it as is may be okay too but everyone will have to change it (except for you 😉). I have a slight preference for setting it to None.
There was a problem hiding this comment.
Yep I have the same concern. I found your defaults too high, so I set mines, but it is probably wrong too.
On our cluster if we don't configure it, the default will be select=1:ncpus=1:mem=5GB. So worker jobs will probably be killed promptly if dask worker is set with more than 1 thread or process. And I believe it would be really bad for PBS to launch one job for each 1 process dask-worker.
My preference would be more close to what is mentioned in #12, if nothing is provided, resource_spec is computed using dask-worker options. Still not enthusiast about that? It's a good compromise, keep resource_spec keyword, but build it from dask-worker options if not provided. The only concern if for memory limit, there may be a need to extrapolate from String to integer...
So maybe it is simpler to set it to None :)...
There was a problem hiding this comment.
We might inspect available memory and cpu cores using tools like psutil and multiprocessing
In [1]: import psutil
In [2]: psutil.virtual_memory().total
Out[2]: 16680607744
In [3]: psutil.cpu_count()
Out[3]: 4There was a problem hiding this comment.
That may be okay. The challenge is that you'll routinely have the situation where the machine that scheduler is running on is spec'd differently than the machines where the workers will be running on. I think this may be worth doing as a first guess.
There was a problem hiding this comment.
I'm not inclined to base the resource_spec on the submitting node resource...
I would much prefer to guess it from the given threads and processes. Else I would default to None, maybe with a warning?
There was a problem hiding this comment.
These will tend to be different on every machine - what about incorporating a user config file
that would allow local customization?
There was a problem hiding this comment.
@jedwards4b Hmm, could you elaborate?
From my understanding of your suggestion, I would prefer to only have to change a keyword value than having a user config file, but it is just my feeling.
There was a problem hiding this comment.
As for the comment from @mrocklin above, for me this could be put into the JobQueueCluster class, for choosing default processes and/or threads and memory. Maybe we ought to set those to None by default?
Commit incoming for demonstrating the automatic resource_spec assignment.
There was a problem hiding this comment.
If we cannot reach a common view about this, I propose to continue discussing it in another issue, maybe #12 or a new one. Then for this pull request, I can put resource_spec to None by default, remove the solution I proposed with my last commit if you don't like it, and eventually merge all this?
There was a problem hiding this comment.
These values are different from one machine to another, but typically don't need to change from one run to another on the same machine, sure a cli should be provided but a config file would allow the user not to have to remember and add the same cli again and again.
| walltime : str | ||
| Walltime for each worker job. | ||
| job_extra : list | ||
| List of other PBS options, for example -j oe. Passed with '#PBS ' prefix |
There was a problem hiding this comment.
maybe say: Each option will be prepended with the #PBS prefix.
| super(SLURMCluster, self).__init__(name=name, processes=processes, **kwargs) | ||
|
|
||
| #TODO has this been tested? This seems weird to use only processes, and not processes * threads | ||
| # There are no memory limit given to Slurm either? |
There was a problem hiding this comment.
I think the initial implementation left those out but we should provide both. @jedwards4b - do you have a comment here?
| self._cancelcmd = 'scancel' | ||
| # Not used | ||
| 'memory': memory | ||
| } |
There was a problem hiding this comment.
see comment above. I think we'll want to include memory.
| local_directory=None, | ||
| extra='', | ||
| **kwargs | ||
| ): |
There was a problem hiding this comment.
Can we add these options to the individual cluster doc strings?
There was a problem hiding this comment.
Not sure I get your point, where do you want to put them? They are already in the JobQueueCluster doc strings. You wan to put it in every implementation?
There was a problem hiding this comment.
My point was just that the JobQueueCluster doc string doesn't show up on the PBSCluster doc string so it requires some unnecessary searching to find all the possible arguments to the constructor.
There was a problem hiding this comment.
There was already some duplicated doc, I updated it.
Does not feel really good about that though, don't like to duplicate docstrings... Is there a better way?
There was a problem hiding this comment.
You could directly take the class docs and add to them. i.e. in each child class definition you do something like:
doc = JobQueueCluster.__doc__
self.__doc__ += "specific docs for Slurm cluster"
Perhaps there's a less hacky way though.
There was a problem hiding this comment.
@rabernat - do I recall you introducing the xarray devs to a decorator approach to doc string templating? I can't seem to dig it up but perhaps someone knows what I'm talking about.
…reads and processes if not given
| #TODO has this been tested? This seems weird to use only processes, and not processes * threads | ||
| # There are no memory limit given to Slurm either? | ||
|
|
||
| #Keeping template for now has I don't know much about slurm. |
There was a problem hiding this comment.
Comments like these can probably be removed
|
|
||
| def test_basic(loop): | ||
| with PBSCluster(walltime='00:02:00', threads_per_worker=2, memory='7GB', | ||
| with PBSCluster(walltime='00:02:00', threads=2, memory='7GB', |
There was a problem hiding this comment.
I'm curious if this name is perhaps a little ambiguous now. It might be worth it for the shorter name, I'm not sure.
There was a problem hiding this comment.
If it would be me, I would name every corresponding attribute with suffix liker "per_worker" or prefix like "worker_". Like mentioned in #7 (comment).
As it stands, I don't see why threads should be treated differently than memory. Or maybe we should change memory to memory_per_worker, as for processes it actually translates in "number of workers"?
|
In general it's fine with me. I'm also happy to defer to others' judgment. |
|
@jhamman I agree with you. There a lot of improvements to be made, but we see on other issues or PR that we should probably push this as a reference to start working on other improvements, as this one defines a basic implementation strategy for specific Scheduler. |
|
… Sent from my iPhone
On Mar 26, 2018, at 12:26 AM, Joe Hamman ***@***.***> wrote:
@jhamman commented on this pull request.
In dask_jobqueue/core.py:
> +
+ #Following class attributes should be overriden by extending classes.
+ submit_command = None
+ cancel_command = None
+
+ def __init__(self,
+ name='dask-worker',
+ threads=4,
+ processes=6,
+ memory='16GB',
+ interface=None,
+ death_timeout=60,
+ local_directory=None,
+ extra='',
+ **kwargs
+ ):
@rabernat - do I recall you introducing the xarray devs to a decorator approach to doc string templating? I can't seem to dig it up but perhaps someone knows what I'm talking about.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub, or mute the thread.
|
|
Thanks @guillaumeeb - I hope to work on this a bit tomorrow so I'm merging now. |
|
@rabernat I did not know about docrep, nice! Just curious have you used docrep in a project already? If that's the case I would be interested in hearing the kind of experience you had with it! |
|
Docrep is used in this module: https://github.com/xgcm/xgcm/blob/master/xgcm/grid.py It's useful when many different docstrings have to share very similar components...avoids repetition and thus potential errors. |
|
Nice, thanks for the link! |
So looking for comment here as this is my first pull request ever.
I've made some significant changes from the master branch:
projectshould not raise #7 first mentioned, I've remove the check on project being defined or not,A question too on SLURMCluster. I've modified its implementation, but I've left the fixed header template logic here, as I am not a SLURM user. Anyway I still have some questioning about the CPU and memory resources in this class : we use only the processes (so nprocs) of dask here, but I would expect to see something like processes * nthreads; moreover, I don't see any memory limit in the template.
Finaly, I've not yet tested this in real condition, only some basic play with job script construction into a python terminal (and by the way I believe we should add some basic unit tests like that, don't you think?).